library(dplyr)
library(MASS)
library(imputeTS)
library(rpart)
library(glmnet)
library(zoo)
library(formattable)
library(kableExtra)
library(chron)
library(tidyverse)
library(caret)
library(rpart.plot)
library(randomForest)
library(lmtest)
library(gridExtra)
library(xgboost)
library(faraway)
library(nlme)
library(corrplot)
library(imputeTS)

Abstract

The main objective of this project is to predict the sales of the 111 potentially weather sensitive products like (umbrellas, bread, milk) around the time of major weather events at 45 of the Walmart’s retail locations using statistical and machine learning techiniques. This is an important problem for any big retailer as it will help the replenishment manager to correctly plan and manage the inventory to avoid being out-of-stock or overstock during and after a storm. This will also help in timely intervention to manage the risks arising from the other unforseen circumstances.

Introduction

Walmart is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores, headquartered in Bentonville, Arkansas. The company was founded by Sam Walton in 1962 and incorporated on October 31, 1969. It also owns and operates Sam’s Club retail warehouses. As of October 31, 2019, Walmart has 11,438 stores and clubs in 27 countries, operating under 55 different names.The company operates under the name Walmart in the United States and Canada, as Walmart de México y Centroamérica in Mexico and Central America, as Asda in the United Kingdom, as the Seiyu Group in Japan, and as Best Price in India. It has wholly owned operations in Argentina, Chile, Canada, and South Africa. Since August 2018, Walmart only holds a minority stake in Walmart Brasil, which was renamed Grupo Big in August 2019, with 20 percent of the company’s shares, and private equity firm Advent International holding 80 percent ownership of the company.

Inventory Management: Inventory management is a tricky function of any business and especially with business that are consumer facing, the management of the inventory for different stock keeping units at an optimal quantity becomes one of the critical parts of the business. Anything more than the optimal value might create overhead costs which might not be helpful as it might bring down the profitability and anything below the optimal number can cause consumers to leave the store and buy somewhere else. Hence, it become a key function for any consumer facing business to maintain an optimal invertory for each sku. The problem of inventory managment boils down to one important metric and that is nothing but the demand of a paricular item - which is difficult to predit without any scientific method and also time consuming. Therefore, the goal for a firm is to predict the demand for sku’s and matain those.

Problem faced by Walmart: With more than 11,000 stores across different geographies, inventory management becomes a necessay task for Walmart to focus on their custormers needs and to make sure the right products are available at the right place in the right quantity. The prediction problem at hand is for a subset of sku’s - weather senstitive items like umbrellas, bread etc. around the time of the major weather events that makes it even more variable to predict the demand. Therefore, the task is to use analytical techniques to predict the demand for those essential sku’s during the major weather events.

 

 

Methods

Data

The data has been download from the Kaggle website. This dataset contains sales data for 111 products whose sales may be affected by weather (such as milk, bread , umbrellasetc.) These 111 products are sold in stores at 45 different Walmart locations. Some of the products may be a similar item (such as milk) but have a different id in different stores/regions/suppliers. The 45 locations are covered by 20 weather stations (i.e. some of the stores are nearby and share a weather station).

For the purposes of this project, a weather event is defined as any day in which more than an inch of rain or two inches of snow was observed. The prediction is the units sold for a window of ±3 days surrounding each storm.

The following graphic shows the layout of the test windows. The green dots are the training set days, the red dots are the test set days, and the event=True are the days with storms. Note that this plot is for the 20 weather stations. All days prior to 2013-04-01 are given out as training data.

## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:lmtest':
## 
##     reset
## The following object is masked from 'package:tidyr':
## 
##     complete

 

Exploratory Data Analysis

The descriptions of the data files and the field descriptions are given below.

Field descriptions

File descriptions

Data cleaning

The data from raw csv files is initially loaded into R dataframes.

On early examination of the train dataset, we observed that for a large number of items, there were no sales recorded. Refer below image:

With no past sales data with us, it makes sense to remove these combinations from training set. On encountering such combination in our test set, we would predict the sales to be zero.

Data merging

As discussed above, we first filter store and item number combinations having zero sales from the training dataset. Using attribute store_nbr and item_nbr, dataframe key is joined to train dataframe. The resulting dataframe is then joined to weather dataframe using attribute station_nbr and date. This join helped us to link the weather data recorded by the weather station in a particular store’s region to the corresponding sales data for given date.Similar join was also performed on test dataset to get required features for prediction. R function merge was used to implement this join.

Feature Engineering

Various important features can be extracted from the date attribute that impact the sales. Some of the examples include day of the week, month, year,etc. Below is a description about the new columns generated from Date:

  1. Month: Numeric attribute indicating the month.
  2. Year: Factor attribute indicating the year.
  3. Day: Numeric attribute indicating the day of month.
  4. Weekdays: Numeric attribute indicating the day of week.
  5. Dayseries: Numeric attribute which is an index for the dates starting with 0 for the oldest date and incementing by 1 for every date.
  6. Season: Factor attribute indicating season viz.winter,summer,fall or spring.
  7. is_Holiday: Boolean attribute indicating whether the given date is a holiday or not. 1 for holiday, 0 for not holiday.

Visualization of weather attributes and units sold.

Below are some plots to study relationship between different weather attributes and units sold:

  1. Snowfall vs Units Sold:

It can be observed from the graph that the amount of snowfall received and units sold have an inverse relationship. It makes sense since people generally will avoid going outdoors and shopping on snowy days. When there is no snowfall, the number of items sold is highest.

  1. Precipitation vs Units Sold:

Precipitation can be viewed as the chances of receiving rain, snow, sleet, freezing rain, and hail. While plotting, we removed outliers by removing datapoints exceeding 1.5 time the interquartile range. A trend similar to snowfall is observed in precipitation with respect to snowfall. On a clear sunny day, sale of items is maximum. Since precipitation will account for many other adverse weather conditions along with snowfall, it makes sense to include precipitation instead of snowfall.

  1. Daily maximum temperature vs Units Sold:

tmax attribute gives the average daily temperature. Looking at the graph, we can conclude that sale of items is less for extreme temperatures, while days with temperatures between 40-90 degrees recorded high sales.

  1. Sealevel vs Units Sold:

Sealevel provides the atmospheric pressure with respect to sea level. Pressures around 30 shows maximum item sales and it appears to be.

Visualization of date attributes and units sold.

1.Items sold per store

The first graph shows the total sales per stores grouped yearly. It can be observed that the sales of items decreased between 2012 and 2014.

2.Items sold per month

The second graph shows the total sales per month. Being festive season, it can be observed that the sales are more between November and February. For rest of the year, the sales remain pretty constant. A small rise in sales can be observed during August, which also happens to be the beginning of academic term for students.

3.Items sold per day of the month

The third graph shows the variation of sales with the day of the month. The sales are more at the beginning of the month and drastically fall as the end of the month approaches. This logically makes sense, because, salaries are credited at the beginning of the month, increasing peoples purchasing power, which gradually decreases as the end of the month approaches.

4.Items sold per day of the week

The last graph shows the variation of sales with the day of the eek. The sales are low during weekdays and increases during Fridays and weekends, which again makes sense.

##Correlation Heatmap

Above figure shows correlation between available weather parameters and units sold. The blue colored tiles show positive correlation while red colored tiles are for negative correlation. The brightness of the tiles indicates the respective strength of correlation. It can be observed that units sale has a very weak correlation with weather parmeters, meaning that most of the weather parameters dont have a strong impact on the sales of items.

Linear Regression Model / Diagnostics

Our baseline model contains all the information regarding weather, item, store etc. except date, station_nbr and is_weekend because these details our available via other predictors that we have in our model. From the table above we can see that this model is not good due to mulitple reasons. There are various predictors such as \(dewpoint\), \(sealevel\), \(resultdir\) etc. that are non significant, that is, there \(p-value\) is greater than \(0.05\). The Kaggle score of our baseline model comes out to be around \(0.51477\) which indicates flaws in this model.

Below is the summary of the regression model:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.20257 0.27217 8.09260 0.00000
store_nbr2 0.30477 0.02127 14.32540 0.00000
store_nbr3 0.13221 0.02089 6.32764 0.00000
store_nbr4 0.57779 0.01967 29.38090 0.00000
store_nbr5 0.06247 0.01723 3.62580 0.00029
store_nbr6 0.22417 0.02330 9.62077 0.00000
store_nbr7 0.39731 0.01888 21.04921 0.00000
store_nbr8 -0.02076 0.01836 -1.13086 0.25812
store_nbr9 0.13106 0.01732 7.56772 0.00000
store_nbr10 0.25160 0.01661 15.14318 0.00000
store_nbr11 0.13741 0.01742 7.88604 0.00000
store_nbr12 0.26709 0.01699 15.71624 0.00000
store_nbr13 0.25389 0.01773 14.31725 0.00000
store_nbr14 -0.55532 0.01566 -35.46505 0.00000
store_nbr15 0.21512 0.05994 3.58914 0.00033
store_nbr16 0.28407 0.01896 14.98362 0.00000
store_nbr17 1.39480 0.02053 67.93946 0.00000
store_nbr18 -0.51856 0.02064 -25.12818 0.00000
store_nbr19 -0.08134 0.01689 -4.81543 0.00000
store_nbr20 0.26403 0.02174 12.14472 0.00000
store_nbr21 -0.01301 0.01981 -0.65686 0.51127
store_nbr22 0.23901 0.01704 14.02343 0.00000
store_nbr23 -0.30706 0.01944 -15.79404 0.00000
store_nbr24 0.59487 0.03424 17.37135 0.00000
store_nbr25 0.28383 0.06009 4.72346 0.00000
store_nbr26 -0.01205 0.01905 -0.63237 0.52714
store_nbr27 0.13359 0.01735 7.70078 0.00000
store_nbr28 -0.16817 0.02243 -7.49795 0.00000
store_nbr29 -0.07319 0.02095 -3.49387 0.00048
store_nbr30 0.19924 0.01776 11.21805 0.00000
store_nbr31 0.25477 0.01683 15.13558 0.00000
store_nbr32 0.06291 0.05991 1.05005 0.29369
store_nbr33 0.54419 0.01773 30.70151 0.00000
store_nbr34 0.27759 0.01690 16.42893 0.00000
store_nbr35 0.06704 0.01746 3.84044 0.00012
store_nbr36 -0.23342 0.02008 -11.62519 0.00000
store_nbr37 0.02381 0.06017 0.39572 0.69231
store_nbr38 0.29687 0.02023 14.67635 0.00000
store_nbr39 -2.03570 0.02111 -96.45544 0.00000
store_nbr40 0.06149 0.05992 1.02620 0.30480
store_nbr41 0.22222 0.01659 13.39459 0.00000
store_nbr42 -0.02294 0.02363 -0.97070 0.33170
store_nbr43 0.15203 0.01710 8.88940 0.00000
store_nbr44 0.17443 0.01708 10.21076 0.00000
store_nbr45 -0.62334 0.01951 -31.95673 0.00000
item_nbr2 -0.13913 0.02897 -4.80238 0.00000
item_nbr3 -0.41859 0.03422 -12.23095 0.00000
item_nbr4 0.19276 0.03524 5.46912 0.00000
item_nbr5 3.03823 0.02574 118.02733 0.00000
item_nbr6 2.28120 0.04470 51.02812 0.00000
item_nbr7 -0.60533 0.03488 -17.35576 0.00000
item_nbr8 2.77569 0.03591 77.29613 0.00000
item_nbr9 2.93084 0.02501 117.18075 0.00000
item_nbr10 -0.34975 0.03379 -10.35116 0.00000
item_nbr11 -0.53993 0.03399 -15.88507 0.00000
item_nbr12 -0.59794 0.03295 -18.14718 0.00000
item_nbr13 -0.51168 0.03425 -14.94154 0.00000
item_nbr14 -0.65384 0.03572 -18.30486 0.00000
item_nbr15 -0.32652 0.02702 -12.08455 0.00000
item_nbr16 2.59867 0.02649 98.08151 0.00000
item_nbr17 -0.44682 0.03272 -13.65637 0.00000
item_nbr18 -0.52259 0.03475 -15.03924 0.00000
item_nbr19 -0.35230 0.03288 -10.71612 0.00000
item_nbr20 -0.13591 0.03442 -3.94879 0.00008
item_nbr21 -0.29554 0.03010 -9.81776 0.00000
item_nbr22 0.08940 0.03524 2.53646 0.01120
item_nbr23 2.04034 0.03053 66.82620 0.00000
item_nbr24 0.03815 0.03439 1.10937 0.26727
item_nbr25 3.18516 0.02988 106.58792 0.00000
item_nbr26 1.31514 0.03524 37.31467 0.00000
item_nbr27 1.16641 0.03482 33.49651 0.00000
item_nbr28 1.05815 0.03429 30.85946 0.00000
item_nbr29 0.18132 0.03489 5.19766 0.00000
item_nbr30 0.22079 0.02798 7.89202 0.00000
item_nbr31 -0.55337 0.03353 -16.50575 0.00000
item_nbr32 -0.34188 0.03442 -9.93269 0.00000
item_nbr33 0.14312 0.03339 4.28649 0.00002
item_nbr34 0.05270 0.03524 1.49532 0.13483
item_nbr35 -0.00690 0.03509 -0.19679 0.84399
item_nbr36 3.58904 0.03074 116.74718 0.00000
item_nbr37 2.00968 0.02714 74.05959 0.00000
item_nbr38 -0.08033 0.03313 -2.42471 0.01532
item_nbr39 -0.62877 0.03025 -20.78839 0.00000
item_nbr40 -0.49554 0.03429 -14.45163 0.00000
item_nbr41 2.50208 0.03104 80.61678 0.00000
item_nbr42 0.39355 0.03379 11.64731 0.00000
item_nbr43 2.56430 0.04470 57.36090 0.00000
item_nbr44 3.93058 0.02501 157.15252 0.00000
item_nbr45 3.43035 0.02582 132.84210 0.00000
item_nbr46 -0.49770 0.03423 -14.53808 0.00000
item_nbr47 0.03062 0.03429 0.89297 0.37188
item_nbr48 0.58517 0.03534 16.55667 0.00000
item_nbr49 -0.67579 0.02823 -23.93665 0.00000
item_nbr50 -0.34116 0.02717 -12.55818 0.00000
item_nbr51 -0.51514 0.02737 -18.82051 0.00000
item_nbr52 -0.31619 0.02978 -10.61885 0.00000
item_nbr53 -0.32462 0.03313 -9.79843 0.00000
item_nbr54 -0.54320 0.03414 -15.91107 0.00000
item_nbr55 -0.23454 0.03292 -7.12407 0.00000
item_nbr56 -0.02984 0.03119 -0.95673 0.33871
item_nbr57 0.16612 0.03555 4.67295 0.00000
item_nbr58 -0.25545 0.03288 -7.77002 0.00000
item_nbr59 -0.08262 0.03438 -2.40352 0.01624
item_nbr60 -0.79699 0.04470 -17.82779 0.00000
item_nbr61 -0.43536 0.02838 -15.34009 0.00000
item_nbr62 -0.45956 0.03433 -13.38536 0.00000
item_nbr63 -0.07799 0.03439 -2.26793 0.02333
item_nbr64 -0.59742 0.03426 -17.43706 0.00000
item_nbr65 -0.49318 0.03425 -14.40136 0.00000
item_nbr66 2.10128 0.03439 61.10148 0.00000
item_nbr67 0.66335 0.03349 19.80562 0.00000
item_nbr68 2.75143 0.02726 100.91822 0.00000
item_nbr69 -0.00262 0.03443 -0.07599 0.93943
item_nbr70 0.17486 0.03339 5.23712 0.00000
item_nbr71 -1.73917 0.03534 -49.20773 0.00000
item_nbr72 0.22679 0.03555 6.37956 0.00000
item_nbr73 -0.47817 0.03422 -13.97196 0.00000
item_nbr74 -0.47015 0.03475 -13.53006 0.00000
item_nbr75 -0.24821 0.03443 -7.20895 0.00000
item_nbr76 -1.04509 0.04470 -23.37770 0.00000
item_nbr77 -0.51610 0.03426 -15.06336 0.00000
item_nbr78 -0.51037 0.03414 -14.94954 0.00000
item_nbr79 -0.49735 0.03418 -14.55308 0.00000
item_nbr80 -0.49990 0.03430 -14.57624 0.00000
item_nbr81 0.03478 0.03295 1.05547 0.29121
item_nbr82 -0.28996 0.03602 -8.05069 0.00000
item_nbr83 1.70898 0.03443 49.63584 0.00000
item_nbr84 -0.74336 0.02830 -26.26929 0.00000
item_nbr85 -0.94493 0.02816 -33.55775 0.00000
item_nbr86 -0.74029 0.02656 -27.87227 0.00000
item_nbr87 -0.20620 0.03431 -6.00937 0.00000
item_nbr88 -0.23080 0.02997 -7.70050 0.00000
item_nbr89 -0.53845 0.03429 -15.70314 0.00000
item_nbr90 -0.62211 0.03482 -17.86558 0.00000
item_nbr91 -0.26371 0.03447 -7.65001 0.00000
item_nbr92 -0.77514 0.03352 -23.12829 0.00000
item_nbr93 -0.50973 0.02539 -20.07513 0.00000
item_nbr94 -0.30198 0.03433 -8.79543 0.00000
item_nbr95 0.57535 0.03488 16.49606 0.00000
item_nbr96 -1.06208 0.03482 -30.50037 0.00000
item_nbr97 -1.04388 0.03482 -29.97760 0.00000
item_nbr98 -0.52685 0.02976 -17.70042 0.00000
item_nbr99 -0.38210 0.03429 -11.14344 0.00000
item_nbr100 -0.28132 0.03438 -8.18368 0.00000
item_nbr101 -0.96979 0.03482 -27.84985 0.00000
item_nbr102 -0.71879 0.03408 -21.09320 0.00000
item_nbr103 1.57786 0.03573 44.15768 0.00000
item_nbr104 -0.71397 0.02759 -25.88257 0.00000
item_nbr105 -0.63953 0.02695 -23.73134 0.00000
item_nbr106 -0.72823 0.02991 -24.34534 0.00000
item_nbr107 -0.54561 0.03572 -15.27505 0.00000
item_nbr108 -0.21445 0.03410 -6.28879 0.00000
item_nbr109 -0.02651 0.03005 -0.88230 0.37762
item_nbr110 -0.42641 0.03431 -12.42696 0.00000
item_nbr111 1.54725 0.03573 43.30121 0.00000
tmax -0.00269 0.00038 -7.12875 0.00000
tmin -0.00008 0.00042 -0.20089 0.84078
dewpoint 0.00000 0.00028 0.01216 0.99029
wetbulb 0.00095 0.00048 1.96904 0.04895
heat -0.00014 0.00064 -0.21569 0.82923
cool 0.00254 0.00067 3.76938 0.00016
snowfall -0.00183 0.00484 -0.37761 0.70572
preciptotal 0.04391 0.00502 8.73988 0.00000
stnpressure -0.00088 0.01040 -0.08459 0.93259
sealevel -0.03931 0.01291 -3.04563 0.00232
resultspeed -0.00125 0.00076 -1.64607 0.09975
resultdir -0.00015 0.00014 -1.02775 0.30407
avgspeed 0.00121 0.00087 1.38559 0.16587
month02 -4.80778 0.21453 -22.41113 0.00000
month03 -9.17597 0.41065 -22.34514 0.00000
month04 -13.96321 0.62519 -22.33438 0.00000
month05 -18.57670 0.83282 -22.30578 0.00000
month06 -23.33453 1.04729 -22.28087 0.00000
month07 -27.96589 1.25491 -22.28518 0.00000
month08 -32.72793 1.46935 -22.27372 0.00000
month09 -37.55732 1.68391 -22.30367 0.00000
month10 -42.17783 1.89145 -22.29925 0.00000
month11 -46.96653 2.10614 -22.29982 0.00000
month12 -51.52611 2.31363 -22.27071 0.00000
year13 -56.32006 2.52670 -22.28996 0.00000
year14 -112.60552 5.05225 -22.28820 0.00000
day02 -0.19638 0.01216 -16.14451 0.00000
day03 -0.35054 0.01702 -20.59418 0.00000
day04 -0.52692 0.02302 -22.89080 0.00000
day05 -0.63025 0.02942 -21.42230 0.00000
day06 -0.82694 0.03598 -22.98173 0.00000
day07 -0.97482 0.04270 -22.82743 0.00000
day08 -1.14364 0.04947 -23.11574 0.00000
day09 -1.31273 0.05627 -23.33107 0.00000
day10 -1.40344 0.06308 -22.24713 0.00000
day11 -1.57385 0.06993 -22.50538 0.00000
day12 -1.75806 0.07678 -22.89618 0.00000
day13 -1.91101 0.08364 -22.84933 0.00000
day14 -2.08284 0.09053 -23.00690 0.00000
day15 -2.23627 0.09742 -22.95583 0.00000
day16 -2.40891 0.10428 -23.09943 0.00000
day17 -2.57709 0.11114 -23.18859 0.00000
day18 -2.72382 0.11801 -23.08066 0.00000
day19 -2.89539 0.12490 -23.18132 0.00000
day20 -3.03859 0.13182 -23.05117 0.00000
day21 -3.20127 0.13869 -23.08149 0.00000
day22 -3.37318 0.14561 -23.16652 0.00000
day23 -3.51796 0.15251 -23.06643 0.00000
day24 -3.66229 0.15942 -22.97329 0.00000
day25 -3.85695 0.16631 -23.19070 0.00000
day26 -3.98035 0.17325 -22.97491 0.00000
day27 -4.13572 0.18016 -22.95534 0.00000
day28 -4.29873 0.18709 -22.97739 0.00000
day29 -4.45766 0.19387 -22.99341 0.00000
day30 -4.60301 0.20092 -22.90950 0.00000
day31 -4.74245 0.20791 -22.81030 0.00000
weekdaysMonday -0.09288 0.00475 -19.55286 0.00000
weekdaysTuesday -0.15502 0.00476 -32.59206 0.00000
weekdaysWednesday -0.18705 0.00475 -39.38681 0.00000
weekdaysThursday -0.18067 0.00476 -37.98422 0.00000
weekdaysFriday -0.11609 0.00476 -24.39134 0.00000
weekdaysSaturday -0.02548 0.00478 -5.33550 0.00000
day_series 0.15395 0.00692 22.25046 0.00000
is_holiday -0.08612 0.00633 -13.60435 0.00000

From summary,it can be observed that most of the weather features have high p-value, which further supports our hypothesis that weather attributes dont have strong effect on sales. Using p-values as a reference, we decided to include tmax,preciptotal and sealevel as a metric in our model.

Let’s take a look at the diagnostic plots:

1.Residuals vs Fitted

The first diagnostic plot is used to identify non-linear patterns between residuals and fitted values. Looking at the plot, a specific pattern appears to be emerging. That is the points are not randomly distributed along the line. Hence linear assumption doesn’t hold true in this case.

2.Normal Q-Q Plot

The Q-Q plot is used to test the normality assumption of residuals. It can be observed that the residuals are not normally distributed. We need to address this issue before we can fit a linear model.

3.Scale Location Plot

This plot is used to test the homoscedasticity of the errors. A horizontal line with equally distributed points indicates homoscedasticity. Looking at the plot for our data, we see that the errors are not homoscedastic.

4.Residuals vs Leverage plot

This plot helps us to identify influential points by using Cooks distance. Since no points lie outside the Cooks distance threshold, we conclude that there are no influential points.

Our Kaggle score for the baseline model is \(0.51477\). It is very close to the all-zeros solution as we have not done any feature selection and just run the model on the entire dataset. This results in values very close to \(0\) as the model is not sure of its predictions.

##        store_nbr2        store_nbr3        store_nbr4        store_nbr5 
##      5.151714e+00      6.079205e+00      6.692746e+00      4.097466e+00 
##        store_nbr6        store_nbr7        store_nbr8        store_nbr9 
##      4.962212e+00      4.286724e+00      4.817230e+00      4.408312e+00 
##       store_nbr10       store_nbr11       store_nbr12       store_nbr13 
##      3.187772e+00      4.237393e+00      3.419900e+00      3.783850e+00 
##       store_nbr14       store_nbr15       store_nbr16       store_nbr17 
##      4.789995e+00      7.437912e+01      6.342243e+00      4.129591e+00 
##       store_nbr18       store_nbr19       store_nbr20       store_nbr21 
##      4.207465e+00      5.154341e+00      6.582161e+00      3.744624e+00 
##       store_nbr22       store_nbr23       store_nbr24       store_nbr25 
##      4.053919e+00      3.734261e+00      1.461720e+01      6.570373e+01 
##       store_nbr26       store_nbr27       store_nbr28       store_nbr29 
##      2.700409e+00      3.513809e+00      4.706466e+00      4.186297e+00 
##       store_nbr30       store_nbr31       store_nbr32       store_nbr33 
##      4.258567e+00      4.164823e+00      6.531201e+01      3.732290e+00 
##       store_nbr34       store_nbr35       store_nbr36       store_nbr37 
##      4.196417e+00      5.339873e+00      4.053431e+00      6.588137e+01 
##       store_nbr38       store_nbr39       store_nbr40       store_nbr41 
##      5.567301e+00      4.565187e+00      6.534262e+01      3.799511e+00 
##       store_nbr42       store_nbr43       store_nbr44       store_nbr45 
##      3.843559e+00      4.139713e+00      4.028646e+00      5.621386e+00 
##         item_nbr2         item_nbr3         item_nbr4         item_nbr5 
##      6.023229e+00      2.817508e+00      3.121994e+00      2.728956e+01 
##         item_nbr6         item_nbr7         item_nbr8         item_nbr9 
##      5.064956e+00      2.974187e+00      3.112111e+00      2.466548e+01 
##        item_nbr10        item_nbr11        item_nbr12        item_nbr13 
##      2.854424e+00      2.669674e+00      2.897008e+00      2.688382e+00 
##        item_nbr14        item_nbr15        item_nbr16        item_nbr17 
##      2.948277e+00      9.216833e+00      1.273018e+01      2.856559e+00 
##        item_nbr18        item_nbr19        item_nbr20        item_nbr21 
##      2.856891e+00      2.884092e+00      2.778083e+00      4.257525e+00 
##        item_nbr22        item_nbr23        item_nbr24        item_nbr25 
##      3.121994e+00      4.525871e+00      2.658229e+00      4.249244e+00 
##        item_nbr26        item_nbr27        item_nbr28        item_nbr29 
##      3.121994e+00      3.073090e+00      2.883941e+00      3.042676e+00 
##        item_nbr30        item_nbr31        item_nbr32        item_nbr33 
##      1.991832e+00      2.810143e+00      2.778083e+00      2.801812e+00 
##        item_nbr34        item_nbr35        item_nbr36        item_nbr37 
##      3.121994e+00      2.912818e+00      4.743633e+00      1.045551e+01 
##        item_nbr38        item_nbr39        item_nbr40        item_nbr41 
##      2.928886e+00      4.176377e+00      2.883941e+00      4.586203e+00 
##        item_nbr42        item_nbr43        item_nbr44        item_nbr45 
##      2.854424e+00      5.064956e+00      7.375485e+00      2.501503e+01 
##        item_nbr46        item_nbr47        item_nbr48        item_nbr49 
##      2.865500e+00      2.883941e+00      3.096845e+00      5.701322e+00 
##        item_nbr50        item_nbr51        item_nbr52        item_nbr53 
##      8.866756e+00      7.461361e+00      4.249170e+00      2.928886e+00 
##        item_nbr54        item_nbr55        item_nbr56        item_nbr57 
##      2.763825e+00      2.892322e+00      4.761282e+00      3.159643e+00 
##        item_nbr58        item_nbr59        item_nbr60        item_nbr61 
##      2.884092e+00      2.870615e+00      5.064956e+00      5.901067e+00 
##        item_nbr62        item_nbr63        item_nbr64        item_nbr65 
##      2.764220e+00      2.658229e+00      2.656970e+00      2.688382e+00 
##        item_nbr66        item_nbr67        item_nbr68        item_nbr69 
##      2.658229e+00      2.804649e+00      8.640702e+00      2.748699e+00 
##        item_nbr70        item_nbr71        item_nbr72        item_nbr73 
##      2.801812e+00      3.096845e+00      3.159643e+00      2.817508e+00 
##        item_nbr74        item_nbr75        item_nbr76        item_nbr77 
##      2.856891e+00      2.748699e+00      5.064956e+00      2.656970e+00 
##        item_nbr78        item_nbr79        item_nbr80        item_nbr81 
##      2.763825e+00      2.809484e+00      2.789085e+00      2.897008e+00 
##        item_nbr82        item_nbr83        item_nbr84        item_nbr85 
##      2.997667e+00      2.748699e+00      6.034325e+00      5.498893e+00 
##        item_nbr86        item_nbr87        item_nbr88        item_nbr89 
##      1.178819e+01      2.791940e+00      4.323788e+00      2.883941e+00 
##        item_nbr90        item_nbr91        item_nbr92        item_nbr93 
##      3.073090e+00      2.867841e+00      2.710865e+00      4.048164e+01 
##        item_nbr94        item_nbr95        item_nbr96        item_nbr97 
##      2.764220e+00      2.974187e+00      3.073090e+00      6.121081e+00 
##        item_nbr98        item_nbr99       item_nbr100       item_nbr101 
##      4.245740e+00      2.883941e+00      2.870615e+00      3.073090e+00 
##       item_nbr102       item_nbr103       item_nbr104       item_nbr105 
##      2.747546e+00      3.313084e+00      7.506241e+00      1.033371e+01 
##       item_nbr106       item_nbr107       item_nbr108       item_nbr109 
##      4.486464e+00      2.948277e+00      2.726780e+00      4.299551e+00 
##       item_nbr110       item_nbr111              tmax              tmin 
##      2.791940e+00      3.313084e+00      3.270896e+01      3.908446e+01 
##          dewpoint           wetbulb              heat              cool 
##      1.884055e+01      3.991413e+01      4.757838e+01      1.560751e+01 
##          snowfall       preciptotal       stnpressure          sealevel 
##      1.046240e+00      1.145537e+00      2.531872e+02      3.701896e+00 
##       resultspeed         resultdir          avgspeed           month02 
##      6.124752e+00      1.153738e+00      6.926234e+00      2.284637e+03 
##           month03           month04           month05           month06 
##      9.095118e+03      1.962397e+04      3.414125e+04      5.162643e+04 
##           month07           month08           month09           month10 
##      7.504148e+04      1.090469e+05      1.333416e+05      1.839436e+05 
##           month11           month12            year13            year14 
##      1.585655e+05      1.954157e+05      8.948611e+05      3.162378e+06 
##             day02             day03             day04             day05 
##      2.887618e+00      5.734152e+00      1.037695e+01      1.694440e+01 
##             day06             day07             day08             day09 
##      2.594271e+01      3.628287e+01      4.849377e+01      6.264116e+01 
##             day10             day11             day12             day13 
##      7.889220e+01      9.553245e+01      1.154619e+02      1.370900e+02 
##             day14             day15             day16             day17 
##      1.585752e+02      1.827173e+02      2.114409e+02      2.459238e+02 
##             day18             day19             day20             day21 
##      2.807471e+02      3.175667e+02      3.521279e+02      3.961864e+02 
##             day22             day23             day24             day25 
##      4.406867e+02      4.802807e+02      5.257317e+02      5.516347e+02 
##             day26             day27             day28             day29 
##      6.100958e+02      6.537563e+02      7.025292e+02      6.941209e+02 
##             day30             day31    weekdaysMonday   weekdaysTuesday 
##      7.125475e+02      4.940941e+02      1.736234e+00      1.723766e+00 
## weekdaysWednesday  weekdaysThursday    weekdaysFriday  weekdaysSaturday 
##      1.731551e+00      1.731917e+00      1.736636e+00      1.740232e+00 
##        day_series        is_holiday 
##      2.672452e+06      1.144276e+00

From the VIF table also, we see that features such as \(day\_series\), \(stnpressure\), \(cool\), \(heat\) have very high Variance Inflation Factor values.

AIC

The Akaike information criterion (AIC) is an estimated measure of of the quality of each of the possible model. Here is the formula for finding AIC for the model: \(AIC\) = 2k - 2ln(L), where k is the number of parameters in the model and L is the likelihood of the model. We decided to show the backwards method, as forward method takes much more time. Step function, which selects the model based on AIC, works like this: by excluding one variable at a time, AIC for this variable turns out to be the AIC of the given model without this particular variable. After calculating all of the AIC, the variable without which model have higher AIC is excluded. This is done until the best model is found.

From this table we can see that the model was reduced to 13 variables based on AIC. The variables such as tmin, stnpressure, dewpoint, snowfall, heat, avgspeed and resultspeed were excluded from the model. Thus, we should consider to remove some of these variablse.

Lasso

We decided to use LASSO Regression instead of Ridge Regression, because along with Regularization (variance ), LASSO Regression does variable selection by adding a penalty term to the absolute value of the magnitude of coefficients. This makes less contributive variables equal to zero.

From the above table for the LASSO Regression, we can see that the coefficients for tmin, dewpoint, heat, snowfall, resultspeed and avgspeed were set to zero. This indicates that these variables perform poorly in predicting the number of units purchased.

Tukey Test

To evaluate the importance of categorical variables, we use Tukey test. Tukey test tells us if there is a difference between the different levels of the variable and whether we should keep it. From Appendix 1, we can observe that the p-values for the most of the intervals in the store number variable are less than 0.05. Thus, we conclude that store number variable is important and this is what we expected. The same analysis can be applied to item number variable.

From Appendix, we can see that p-values for almost all the intervals of weekdays are less than 0.05. This indicates that weekdays are significantly different from each other. The same procedure is applied to month, year and day variables. For these three variables, we observe similar results. Thus, we will keep these four variables in the model.

Improvements

From the Correlation Table (That was shown in the Diagnostics), we saw that resultspeed, heat, tmin, dewpoint and wetbulb have high correlation with many other variables. This was confirmed from the VIF results too. Also, their p-values in the full model were low and the LASSO regression set their coefficients to zero. For these reasons, we remove these variables from the full model. Based on AIC, p-values and VIF analysis, snowfall, stnpressure, resultdir and avgspeed variables will be removed too.

From the Data Analysis part, we observed that month, day and weekdays variables follow some pattern, from which we can conclude that we can use them as continuous variables. That is why we will change their format to numeric. This will result in faster processing of the code and possibly better results.

One of the possible reasons, why we have nonnormal distribution of the errors and nonconstant variance, can be that our model underfits. Thus, we should test whether adding intersection variables and higher order variables can help. The possible intersection term can be store number and the maximum temperature, as stores are located in different cities and thereby have different maximum temperature.

We will use Breusch-Pagan test to see whether new model has a constant variance. The nul and alternative hypotheses are: \(H_0\): The errors have constant variance. \(H_1\): There is a heteroskedasticity

## 
##  studentized Breusch-Pagan test
## 
## data:  lm(log(units + 1) ~ store_nbr * tmax + item_nbr + tmax + cool +     sealevel + preciptotal + month + year + day + weekdays +     is_holiday, train_data)
## BP = 56010, df = 208, p-value < 2.2e-16

The p-value for Breusch–Pagan test is less than alpha, so we reject null hypothesis and conclude that residuals are still heteroskedastic.

Trying adding the intersection for number of stores and cool doesn’t improve the situation.

## 
##  studentized Breusch-Pagan test
## 
## data:  lm(log(units + 1) ~ store_nbr * cool + item_nbr + tmax + cool +     sealevel + preciptotal + month + year + day + weekdays +     is_holiday, train_data)
## BP = 55437, df = 208, p-value < 2.2e-16

Adding the second order terms for tmax and preciptotal does not improve the constancy of errors too.

## 
##  studentized Breusch-Pagan test
## 
## data:  lm(log(units + 1) ~ store_nbr + item_nbr + I(tmax^2) + cool +     sealevel + preciptotal + month + year + day + weekdays +     is_holiday, train_data)
## BP = 56109, df = 164, p-value < 2.2e-16
## 
##  studentized Breusch-Pagan test
## 
## data:  lm(log(units + 1) ~ store_nbr + item_nbr + I(preciptotal^2) +     cool + sealevel + preciptotal + month + year + day + weekdays +     is_holiday, train_data)
## BP = 55429, df = 164, p-value < 2.2e-16

We are not adding the intersection and higher degree terms to the store number and item number because it is not feasible and it will make our models run much slower.

The possible reason for nonconstancy of errors is that some variables are correlated with each other. That is why we will use Generalised Least Squares Model (GLSM) to see if it can make the erros more constant. The generalized least squares (GLS) estimator in linear regression is a generalization of the ordinary least squares (OLS) estimator. There are cases when the Ordinary Least Squares estimator is not BLUE (best linear unbiased estimator) mainly because the errors are not heteroscedastic. In such situations, provided that the other BLUE assumptions are true, the GLS estimator can be used.

From the above plot, we see that errors still have non-constant variance, so GLS model didn’t help.

As the number of observations is very high, we couldn’t run Generalized Linear Model to test whether GLM could make the errors normally distributed. The glm function for our model terminates Rstudio.

This paper discusses and shows that the t-test and least-squares linear regression do not require any assumption of Normal distribution in sufficiently large samples. As our dataset is quite large, we allow the errors to be not normally distributed. That is why we will use the following model as our best linear model.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.50902 0.23555 10.65159 0.00000
store_nbr2 0.30478 0.01708 17.84468 0.00000
store_nbr3 0.13313 0.01659 8.02275 0.00000
store_nbr4 0.57716 0.01879 30.71056 0.00000
store_nbr5 0.06524 0.01703 3.83093 0.00013
store_nbr6 0.22417 0.01958 11.44851 0.00000
store_nbr7 0.39878 0.01842 21.64466 0.00000
store_nbr8 -0.01990 0.01740 -1.14375 0.25273
store_nbr9 0.13071 0.01648 7.93181 0.00000
store_nbr10 0.25437 0.01640 15.51220 0.00000
store_nbr11 0.13875 0.01714 8.09714 0.00000
store_nbr12 0.26998 0.01683 16.03853 0.00000
store_nbr13 0.25536 0.01723 14.81813 0.00000
store_nbr14 -0.55497 0.01576 -35.21837 0.00000
store_nbr15 0.21634 0.01498 14.44432 0.00000
store_nbr16 0.28424 0.01633 17.40057 0.00000
store_nbr17 1.39428 0.01970 70.76909 0.00000
store_nbr18 -0.51890 0.01999 -25.96097 0.00000
store_nbr19 -0.08344 0.01681 -4.96493 0.00000
store_nbr20 0.26496 0.01767 14.99899 0.00000
store_nbr21 -0.01304 0.01898 -0.68710 0.49202
store_nbr22 0.24034 0.01674 14.35617 0.00000
store_nbr23 -0.30740 0.01873 -16.41119 0.00000
store_nbr24 0.59424 0.03393 17.51189 0.00000
store_nbr25 0.28505 0.01559 18.28833 0.00000
store_nbr26 -0.01239 0.01832 -0.67643 0.49877
store_nbr27 0.13493 0.01706 7.91086 0.00000
store_nbr28 -0.16724 0.01852 -9.03124 0.00000
store_nbr29 -0.07321 0.02018 -3.62767 0.00029
store_nbr30 0.19881 0.01676 11.86486 0.00000
store_nbr31 0.25443 0.01596 15.94088 0.00000
store_nbr32 0.06413 0.01487 4.31324 0.00002
store_nbr33 0.54416 0.01676 32.47460 0.00000
store_nbr34 0.27725 0.01603 17.29652 0.00000
store_nbr35 0.07822 0.01647 4.75033 0.00000
store_nbr36 -0.23543 0.01967 -11.96608 0.00000
store_nbr37 0.02503 0.01590 1.57435 0.11541
store_nbr38 0.29687 0.01573 18.86825 0.00000
store_nbr39 -2.03146 0.02092 -97.08759 0.00000
store_nbr40 0.06272 0.01493 4.20189 0.00003
store_nbr41 0.22499 0.01637 13.74162 0.00000
store_nbr42 -0.02294 0.01998 -1.14777 0.25106
store_nbr43 0.15492 0.01694 9.14321 0.00000
store_nbr44 0.17720 0.01688 10.49785 0.00000
store_nbr45 -0.62300 0.01964 -31.72033 0.00000
item_nbr2 -0.13913 0.02920 -4.76507 0.00000
item_nbr3 -0.41859 0.03449 -12.13592 0.00000
item_nbr4 0.19276 0.03552 5.42663 0.00000
item_nbr5 3.03823 0.02594 117.11037 0.00000
item_nbr6 2.28120 0.04505 50.63168 0.00000
item_nbr7 -0.60533 0.03515 -17.22092 0.00000
item_nbr8 2.77569 0.03619 76.69562 0.00000
item_nbr9 2.93084 0.02521 116.27037 0.00000
item_nbr10 -0.34975 0.03405 -10.27075 0.00000
item_nbr11 -0.53993 0.03426 -15.76166 0.00000
item_nbr12 -0.59794 0.03321 -18.00620 0.00000
item_nbr13 -0.51168 0.03451 -14.82546 0.00000
item_nbr14 -0.65384 0.03600 -18.16265 0.00000
item_nbr15 -0.32652 0.02723 -11.99067 0.00000
item_nbr16 2.59867 0.02670 97.31951 0.00000
item_nbr17 -0.44682 0.03297 -13.55027 0.00000
item_nbr18 -0.52259 0.03502 -14.92240 0.00000
item_nbr19 -0.35230 0.03313 -10.63287 0.00000
item_nbr20 -0.13591 0.03469 -3.91811 0.00009
item_nbr21 -0.29554 0.03034 -9.74148 0.00000
item_nbr22 0.08940 0.03552 2.51675 0.01184
item_nbr23 2.04034 0.03077 66.30703 0.00000
item_nbr24 0.03815 0.03466 1.10075 0.27101
item_nbr25 3.18516 0.03012 105.75983 0.00000
item_nbr26 1.31514 0.03552 37.02478 0.00000
item_nbr27 1.16641 0.03509 33.23627 0.00000
item_nbr28 1.05815 0.03456 30.61971 0.00000
item_nbr29 0.18132 0.03516 5.15728 0.00000
item_nbr30 0.22079 0.02820 7.83071 0.00000
item_nbr31 -0.55337 0.03379 -16.37752 0.00000
item_nbr32 -0.34188 0.03469 -9.85552 0.00000
item_nbr33 0.14312 0.03365 4.25319 0.00002
item_nbr34 0.05270 0.03552 1.48371 0.13789
item_nbr35 -0.00690 0.03536 -0.19526 0.84519
item_nbr36 3.58904 0.03098 115.84016 0.00000
item_nbr37 2.00968 0.02735 73.48422 0.00000
item_nbr38 -0.08033 0.03339 -2.40587 0.01613
item_nbr39 -0.62877 0.03048 -20.62689 0.00000
item_nbr40 -0.49554 0.03456 -14.33936 0.00000
item_nbr41 2.50208 0.03128 79.99047 0.00000
item_nbr42 0.39355 0.03405 11.55683 0.00000
item_nbr43 2.56430 0.04505 56.91526 0.00000
item_nbr44 3.93058 0.02521 155.93160 0.00000
item_nbr45 3.43035 0.02602 131.81004 0.00000
item_nbr46 -0.49770 0.03450 -14.42513 0.00000
item_nbr47 0.03062 0.03456 0.88603 0.37560
item_nbr48 0.58517 0.03562 16.42804 0.00000
item_nbr49 -0.67579 0.02845 -23.75069 0.00000
item_nbr50 -0.34116 0.02738 -12.46061 0.00000
item_nbr51 -0.51514 0.02759 -18.67430 0.00000
item_nbr52 -0.31619 0.03001 -10.53635 0.00000
item_nbr53 -0.32462 0.03339 -9.72231 0.00000
item_nbr54 -0.54320 0.03441 -15.78746 0.00000
item_nbr55 -0.23454 0.03318 -7.06872 0.00000
item_nbr56 -0.02984 0.03143 -0.94929 0.34247
item_nbr57 0.16612 0.03583 4.63664 0.00000
item_nbr58 -0.25545 0.03313 -7.70966 0.00000
item_nbr59 -0.08262 0.03465 -2.38485 0.01709
item_nbr60 -0.79699 0.04505 -17.68929 0.00000
item_nbr61 -0.43536 0.02860 -15.22091 0.00000
item_nbr62 -0.45956 0.03460 -13.28137 0.00000
item_nbr63 -0.07799 0.03466 -2.25031 0.02443
item_nbr64 -0.59742 0.03453 -17.30159 0.00000
item_nbr65 -0.49318 0.03451 -14.28948 0.00000
item_nbr66 2.10128 0.03466 60.62678 0.00000
item_nbr67 0.66335 0.03376 19.65175 0.00000
item_nbr68 2.75143 0.02748 100.13418 0.00000
item_nbr69 -0.00262 0.03470 -0.07540 0.93990
item_nbr70 0.17486 0.03365 5.19643 0.00000
item_nbr71 -1.73917 0.03562 -48.82544 0.00000
item_nbr72 0.22679 0.03583 6.33000 0.00000
item_nbr73 -0.47817 0.03449 -13.86341 0.00000
item_nbr74 -0.47015 0.03502 -13.42494 0.00000
item_nbr75 -0.24821 0.03470 -7.15294 0.00000
item_nbr76 -1.04509 0.04505 -23.19608 0.00000
item_nbr77 -0.51610 0.03453 -14.94633 0.00000
item_nbr78 -0.51037 0.03441 -14.83340 0.00000
item_nbr79 -0.49735 0.03444 -14.44002 0.00000
item_nbr80 -0.49990 0.03456 -14.46299 0.00000
item_nbr81 0.03478 0.03321 1.04727 0.29498
item_nbr82 -0.28996 0.03630 -7.98814 0.00000
item_nbr83 1.70898 0.03470 49.25022 0.00000
item_nbr84 -0.74336 0.02852 -26.06520 0.00000
item_nbr85 -0.94493 0.02838 -33.29704 0.00000
item_nbr86 -0.74029 0.02677 -27.65573 0.00000
item_nbr87 -0.20620 0.03458 -5.96268 0.00000
item_nbr88 -0.23080 0.03021 -7.64068 0.00000
item_nbr89 -0.53845 0.03456 -15.58114 0.00000
item_nbr90 -0.62211 0.03509 -17.72678 0.00000
item_nbr91 -0.26371 0.03474 -7.59058 0.00000
item_nbr92 -0.77514 0.03378 -22.94861 0.00000
item_nbr93 -0.50973 0.02559 -19.91917 0.00000
item_nbr94 -0.30198 0.03460 -8.72710 0.00000
item_nbr95 0.57535 0.03515 16.36791 0.00000
item_nbr96 -1.06208 0.03509 -30.26341 0.00000
item_nbr97 -1.04388 0.03509 -29.74470 0.00000
item_nbr98 -0.52685 0.03000 -17.56290 0.00000
item_nbr99 -0.38210 0.03456 -11.05687 0.00000
item_nbr100 -0.28132 0.03465 -8.12010 0.00000
item_nbr101 -0.96979 0.03509 -27.63349 0.00000
item_nbr102 -0.71879 0.03434 -20.92933 0.00000
item_nbr103 1.57786 0.03601 43.81462 0.00000
item_nbr104 -0.71397 0.02780 -25.68149 0.00000
item_nbr105 -0.63953 0.02716 -23.54697 0.00000
item_nbr106 -0.72823 0.03015 -24.15620 0.00000
item_nbr107 -0.54561 0.03600 -15.15638 0.00000
item_nbr108 -0.21445 0.03437 -6.23993 0.00000
item_nbr109 -0.02651 0.03029 -0.87544 0.38133
item_nbr110 -0.42641 0.03458 -12.33041 0.00000
item_nbr111 1.54725 0.03601 42.96480 0.00000
month -0.00592 0.00040 -14.76957 0.00000
weekdays -0.00508 0.00065 -7.87230 0.00000
day -0.00453 0.00015 -31.12710 0.00000
year13 -0.10049 0.00303 -33.19556 0.00000
year14 -0.19283 0.00323 -59.74351 0.00000
tmax -0.00231 0.00011 -20.63336 0.00000
cool 0.00382 0.00026 14.50828 0.00000
preciptotal 0.05127 0.00484 10.59565 0.00000
sealevel -0.05174 0.00773 -6.69708 0.00000
is_holiday -0.07720 0.00606 -12.73261 0.00000

The p-value for this model is less than 0.05 and adjusted \(R^2\) is equal to 0.8760862. Also, the p-value for each variable is significat too, this indicates that we keep only important variables. The kaggle score for this model is equal to 0.13302.

## Warning in `[<-.factor`(`*tmp*`, ri, value = c(36L, 3L, 26L, 28L, 43L,
## 45L, : invalid factor level, NA generated

Extra Models

In advanced analysis we use two very different machine learning models.

Decision Tree

Decision Trees (DT) comes from the family of non-parametric supervised learning models. They are both used for classification as well as Regression. Decision Trees build their logic in the form of a tree by breaking down the dataset into smaller chunks while at the same time developing the tree.

The process of subsetting the data into smaller chunks begin by splitting. Splits are formed on a particular variables. Pruning is a process of shortening of branches of the tree. It is done by turning some branches of the tree to leaf nodes and removing the leaf nodes under the original branch. Pruning is required to avoid overfitting which is not a problem in our case, because we have a big data with minimal parameters.

So the question comes to mind is which tree is the best. There could be multiple features and henceforth multiple splits. In theory, multiple features are tried to split the dataset. But a feature that results in highest purity is chosen as the one for splitting. Entropy is a inversely proportional to purity. If a system is completely pure, that is, all the points in the system have same class or value, then the entropy of that system is said to be zero. So, the objective of tree models is to reduce entropy of the tree.

In brief, DT is like a flowchart with terminal/leaf nodes representing classification/decision values and inner nodes represent the logic which branches a tree to multiple directions.

Why we used it?

We used DT for mulitple reasons. - The data that we have is very similar to that of a time series data. Although we don’t use time series modeling techniques like ARIMA, however we wanted to try out models that perform well on time related data. We experiment on the forecasting power of DT models. - From the diagnostics chart, we saw that Walmart data doesn’t fulfill the assumptions of a linear model. Decision Tree models don’t require any such assumption and work on data with different distributions. - It is computationally efficient.

Data Processing

One of the best part about tree based models is that it doesn’t require any fancy data preprocessing. So we used the same in DT that we used for Linear Regression. We didn’t do any further processing apart from basic feature engineering like that for Linear Models.

Results

From the graph we can see that as the size of the tree increases, the cross validation error \(xerror\) decreases. But it saturates after some value of cp. \(cp\) refers to complexity parameter, as \(cp\) decreases, the size of the tree increases. We find the best tree of size 22 to perform best. The Kaggle Score of this model comes out to be \(0.15\) which is similar to that our linear model. It implies that we need more robust models that we explore in the later section.

Xgboost

Xgboost comes from a family of boosting models. Xgboost stands for eXtreme Gradient Bossting. XGB is an ensemble method that builds up multiple weak learners and try to improve their performances iteratively. Models are built sequentially by minimizing the errors from previous models (weak learners) while boosting the performance of high-performing models. Gradient boosting method applies Gradient Descent Algorithm to find minima and minimize errors in sequential models. It uses tree based models to create ensembles. Boosting is a process of subsetting data with replacement. So one data point could be present in multiple weak learners that are built sequentially. Gradient descent is an optimization algorithm used to minimize some function by iteratively moving opposite to the direction of the gradient. This helps in achieving minima for that function. Gradient descent is a core optimization technique in most of the Deep Learning algorithms that don’t have a closed form solution like Linear Regression.

Why we used it?

  • As mentioned earlier, XGBoost uses decision trees inherently to create weak learners. We wanted to highlight the power of boosting techniques in regression tasks. Xgboost works pretty well on forecasting data as well. Moreover since it works on boosting technique, it is pretty good in avoiding overfitting.
  • One of the main reasons, that led us to use Xgboost is its computing power. Although, Xgboost creates a lot of trees and parameters, but it works parallely on each of those. So, it is really fast in updating parameters. One of the other reasons for choosing this model is Xgboost uses parallel optimization and handles missing values and regularization to reduce overfitting.

Data Processing

We didn’t do any special treatment to the data and used the same format of data as we used to building linear regression models. Although, we removed missing data in preprocessing, ensemble methods like Xgboost don’t require this step.

Results

We used 2-fold Cross Valiation and we find that the best \(RMSE\) comes out to be \(0.52\). Our Kaggle score comes out to be \(0.10951\) which way highes than other models that we have used for the analysis. Similar to rpart, RMSE decreases with increase in tree depth.

Appendix

# library(dplyr)
# library(MASS)
# library(imputeTS)
# library(rpart)
# library(glmnet)
# library(zoo)
# library(formattable)
# library(kableExtra)
# library(chron)
# library(tidyverse)
# library(caret)
# library(rpart.plot)
# library(randomForest)
# library(lmtest)
# library(gridExtra)
# library(xgboost)
# library(faraway)
# library(nlme)
# library(corrplot)
# library(imputeTS)
# key_data <- read.csv("data/key.csv")
# train_data <- read.csv("data/train.csv")
# weather_data <- read.csv("data/weather.csv")
# test_data <- read.csv("data/test.csv")
# library("dplyr")
# library("ggplot2")
# sales_stats <- train_data %>% mutate(store_item_combo = paste(.$store_nbr,"-",.$item_nbr)) %>%
#                 group_by(store_item_combo) %>%
#                 summarize(total_items = sum(units)) %>%
#                 mutate(sales_flag = case_when(
#                   .$total_items == 0 ~ 'zero-sales',
#                   TRUE ~ 'non-zero sales'
#                 )
#               ) %>%
#                  group_by(sales_flag) %>%
#                  count()
# ggplot(data=sales_stats, aes(x=sales_flag, y=n)) +
#   geom_bar(stat="identity", width=0.5) + labs(x= "Store Item Combination", y = "Units Sold")
# store_items_cnt <- train_data %>%
#                 group_by(item_nbr, store_nbr) %>%
#                 summarize(total_items = sum(units))
# train_data <- merge(train_data, store_items_cnt[store_items_cnt$total_items != 0, ], by = c("item_nbr", "store_nbr"))
# train_data <- dplyr::select(merge(merge(train_data, key_data, by = "store_nbr"), weather_data, by = c("date", "station_nbr")),
#                -c("total_items")
# )
# test_key_data <- merge(test_data, key_data, by = c("store_nbr"))
# test_data <- merge(test_key_data, weather_data, by = c("station_nbr", "date"))
# numerical_features <- c("tmax", "tmin", "dewpoint", "wetbulb", "heat", "cool", "stnpressure",
#                        "sealevel", "resultspeed", "resultdir", "avgspeed")
# trace_features <- c("snowfall", "preciptotal")
# factor_variables <- c("station_nbr", "item_nbr", "store_nbr")
# dropped_features <- c("codesum", "sunrise", "sunset", "depart", "tavg", "season", "isWeekend")
# holidays <- read.table('data/holidays.txt', sep = " ", fill = TRUE) %>%
#             filter(str_detect(V1, '^20')) %>%
#             mutate(date_holiday = str_c(V1, "-", V2, "-", V3), format = "%Y-%b-%d")
# holidays <- Reduce(c, lapply(holidays$date_holiday,FUN=function(x) as.Date(x, format = "%Y-%b-%d")))
# check_holiday <- function(x) {
# print(x)
#  if(any(holidays == x)){
#    TRUE
#  }
#   FALSE
# }
# create_features = function(df, is_train=TRUE, drop_columns=TRUE){
#   new_date <- as.Date(df$date, "%Y-%m-%d")
#   df$month <- as.factor(strftime(new_date, "%m"))
#   df$year <- as.factor(strftime(new_date, "%y"))
#   df$day <- as.factor(strftime(new_date, "%d"))
#   df$weekdays <- as.factor(weekdays(as.Date(df$date, "%Y-%m-%d")))
#   df$weekdays <- factor(df$weekdays, levels= c("Sunday", "Monday",
#     "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
#   df$is_weekend <- as.factor(df$weekdays %in% c("Sunday", "Saturday"))
#   min_date <- min(as.Date(df$date),na.rm=TRUE)
#   df$day_series <- as.numeric(as.Date(df$date)-min_date)
#   df$is_holiday <- sapply(as.character(df$date), FUN = function(x) ifelse(any(as.character(holidays) == x), 1, 0))
#   # train_data$holiday_info <- sapply(train_data$date, check_holiday)
#   yq <- as.yearqtr(as.yearmon(df$date, "%Y-%m-%d") + 1/12)
#   df$season <- as.factor(factor(format(yq, "%q"), levels = 1:4, labels = c("winter", "spring", "summer", "fall")))
#   df$date <- as.factor(df$date)
#   if(drop_columns == TRUE){
#    df <- df[, !colnames(df) %in% dropped_features]
#   }
#   df[numerical_features] <- sapply(df[numerical_features], as.character)
#   df[trace_features] <- sapply(df[trace_features], as.character)
#   df[df$preciptotal == "  T", ]$preciptotal <- "0.05"
#   df[df$snowfall == "  T", ]$snowfall <- "0.1"
#   df[df == "M"] <- NA
#   df[numerical_features] <- sapply(df[numerical_features], as.numeric)
#   df[trace_features] <- sapply(df[trace_features], as.numeric)
#   if(is_train == TRUE){
#     df$units <- as.numeric(df$units)
#   }
#   for(i in numerical_features){
#     df[is.na(df[,i]), i] <- median(df[,i], na.rm = TRUE)
#   }
#   for(i in trace_features){
#     df[is.na(df[,i]), i] <- median(df[,i], na.rm = TRUE)
#   }
#   df[factor_variables] <- lapply(df[factor_variables], factor)
#   df
# }
# train_data <- create_features(train_data)
# test_data <- create_features(test_data, is_train = FALSE)
# library(ggplot2)
# #Snowfall vs items sold
# IQR_snowfall=IQR(train_data$snowfall)
# items_sold_snowfall <- train_data[train_data$snowfall < 1,] %>%
#   group_by(snowfall, year) %>%
#   summarise(items_sold = sum(units))
# p1=ggplot(items_sold_snowfall, aes(x = snowfall, y = items_sold,group = year, colour = factor(year))) +
#   geom_line() +
#   geom_point() +
#   labs(x= "snowfall", y = "Units Sold", colour = "Year") +
#   theme_classic()
# #Precipitation vs items sold
# IQR_preciptotal=IQR(train_data$preciptotal)
# items_sold_precip <- train_data[train_data$preciptotal < 1.5*IQR_preciptotal,] %>%
#   group_by(preciptotal, year) %>%
#   summarise(items_sold = sum(units))
# p2 = ggplot(items_sold_precip, aes(x = preciptotal, y = items_sold, group = year, colour = factor(year))) +
#   geom_line() +
#   geom_point() +
#   labs(x= "preciptotal", y = "Units Sold", colour = "Year") +
#   #scale_x_discrete(limits = weekdays(as.Date(4,"1970-01-01",tz="GMT")+0:6)) +
#   theme_classic()
# #tmax vs items sold
# items_sold_by_temperature <- train_data %>%
#   group_by(tmax, year) %>%
#   summarise(items_sold = sum(units))
# p3 = ggplot(items_sold_by_temperature, aes(x =tmax, y = items_sold, group = year, colour = factor(year))) +
#   geom_line() +
#   geom_point() +
#   labs(x= "tmax", y = "Units Sold", colour = "Year") +
#   theme_classic()
# #sealevel vs items sold
# items_sold_by_sealevel <- train_data %>%
#   group_by(sealevel, year) %>%
#   summarise(items_sold = sum(units))
# p4 = ggplot(items_sold_by_sealevel, aes(x =sealevel, y = items_sold, group = year, colour = factor(year))) +
#   geom_line() +
#   geom_point() +
#   labs(x= "sealevel", y = "Units Sold", colour = "Year") +
#   theme_classic()
# grid.arrange(p1,p2,p3,p4)
# col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
# corrplot(cor(train_data[c(numerical_features,trace_features,'units')]), method = "color", col = col(200),
#          type = "upper", order = "hclust", number.cex = .7,
#          addCoef.col = "black", # Add coefficient of correlation
#          tl.col = "black", tl.srt = 90, # Text label color and rotation
#          # Combine with significance
#          p.mat = cor.mtest(train_data[c(numerical_features,trace_features,'units')])$p, sig.level = 0.01, insig = "blank",
#          # hide correlation coefficient on the principal diagonal
#          diag = FALSE)
# baseline_model <- lm(log(units + 1) ~ . -date -station_nbr -is_weekend, train_data)
# knitr::kable(summary(baseline_model)$coef, digits=5) %>%
#   kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
# plot(baseline_model)
# vif(baseline_model)
# aic_b <- step(baseline_model, trace = FALSE)
# X <- model.matrix(baseline_model)
# lasso_baseline_cv <- cv.glmnet(X[, -1], log(train_data$units + 1), family="gaussian", alpha = 1)
# lasso_baseline <- glmnet(X[, -1], log(train_data$units + 1), family="gaussian", alpha = 1, lambda = lasso_baseline_cv$lambda.min)
# aic_b <- step(baseline_model)
# TukeyHSD(aov(units ~ store_nbr, train_data))
# TukeyHSD(aov(units ~ weekdays, train_data))
# TukeyHSD(aov(units ~ month, train_data))
# TukeyHSD(aov(units ~ day, train_data))
# TukeyHSD(aov(units ~ year, train_data))
# train_data$month <- as.POSIXlt(train_data$date)$mon
# train_data$day <- as.POSIXlt(train_data$date)$mday
# train_data$weekdays <- as.POSIXlt(train_data$date)$wday
# test_data$month <- as.POSIXlt(test_data$date)$mon
# test_data$day <- as.POSIXlt(test_data$date)$mday
# test_data$weekdays <- as.POSIXlt(test_data$date)$wday
# bptest(lm(log(units + 1) ~ store_nbr*tmax + item_nbr + tmax + cool  + sealevel+ preciptotal + month + year + day + weekdays + is_holiday, train_data))
# bptest(lm(log(units + 1) ~ store_nbr*cool + item_nbr + tmax + cool  + sealevel+ preciptotal + month + year + day + weekdays + is_holiday,  train_data))
# bptest(lm(log(units + 1) ~ store_nbr+ item_nbr + I(tmax^2) + cool  + sealevel + preciptotal + month + year + day + weekdays+ is_holiday, train_data))
# bptest(lm(log(units + 1) ~ store_nbr+ item_nbr + I(preciptotal^2) + cool  + sealevel + preciptotal + month + year + day + weekdays + is_holiday, train_data))
# gls_model <- gls(log(units + 1) ~ store_nbr + item_nbr + tmax + cool  + sealevel + preciptotal + month + year + day + weekdays + is_holiday, train_data)
# plot(gls_model)
# sample_mod <- lm(log(units + 1) ~ store_nbr + item_nbr + month + weekdays + day + year + tmax + cool + preciptotal + sealevel + is_holiday, train_data)
# knitr::kable(summary(sample_mod)$coef, digits=5) %>%
#   kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
# test_data$date <- as.factor(test_data$date)
# test_data$predictions <- predict(rpart_mod, test_data)
# test_data$predictions <- exp(test_data$predictions) - 1
# test_data$predictions[which(test_data$predictions < 0)] <- 0
# temp <- store_items_cnt[store_items_cnt$total_items > 0 ,]
# merged = merge(test_data, temp, by.x = c("store_nbr", "item_nbr"), by.y = c("store_nbr", "item_nbr"), all.x = TRUE)
# merged[which(is.na(merged$total_items)), "predictions"] = 0
# id <- paste(merged$store_nbr, merged$item_nbr, merged$date, sep = "_")
# write.table(cbind(id, merged$predictions),file="data/results.csv",sep =",", row.names=F,col.names=c('id','units'))
# rpart_mod <- rpart(log(units + 1) ~ store_nbr + item_nbr + month + weekdays + day + year + tmax + cool + preciptotal + sealevel + is_holiday, train_data, cp = 0.001)
# set.seed(42)
# ctrl <- trainControl(method = "cv", number = 2, allowParallel = TRUE, verboseIter = FALSE, returnData = FALSE)
# xgb_mod <- caret::train(
#   log(units + 1) ~ store_nbr + item_nbr + month + weekdays + day + year + tmax + cool + preciptotal + sealevel + is_holiday,
#   train_data,
#   method = "xgbTree",
#   trControl = ctrl
# )
# plotcp(rpart_mod)